Load the data for a single patient

source("code.R")
source("cna.R")
library("kableExtra")
options(dplyr.summarise.inform = FALSE)
PDID=params$PDID
treemodel=params$TREEMODEL
MUTCOUNTBIN="./MUTCOUNTBINS.RDS"
bins=readRDS(MUTCOUNTBIN)
PD=readRDS("PD6629.RDS")
# Pathogenic substitution types
CV=c("missense","nonsense","ess_splice","frameshift","inframe","cna","loh","start_lost","nc_ess_splice","stop_lost","complex_sub")
##  Read in list of genes of interest (from other pubs or dnds - here taken from prognostics).
GENES=readLines("GENES.txt")
refcolony=PD$refcolony
CENSOR_GERMLINE=TRUE

The PD data structure:

PD6629 Sample Summary

patient ID age_at_sample_exact cell_type phase BaitLabel
3 PD6629 PD6629dc 55.05270 PB whole Recapture PD6629dc
4 PD6629 PD6629dd 58.83094 PB Gran Recapture PD6629dd
1 PD6629 COLONY60 60.26831 BFU-E-Colony Colony NA
5 PD6629 PD6629de 60.26831 PB Gran Recapture PD6629de
2 PD6629 COLONY62 62.28063 BFU-E-Colony Colony NA
6 PD6629 PD6629df 63.41136 PB Gran Recapture PD6629df
7 PD6629 PD6629dg 64.73374 PB Gran Recapture PD6629dg
8 PD6629 PD6629dh 65.86448 PB Gran Recapture PD6629dh

Tree

tree=plot_basic_tree(PD$pdx,label = PD$patient,style="classic",cv=CV,genes=GENES)

Expanded Tree with Node Labels

The nodes in this plot can be cross-referenced with nodes specified in subsequent results. The plot also serves to give an idea of what the topology at the top of the tree looks like.

tree=plot_basic_tree(expand_short_branches(PD$pdx,prop = 0.1),label = PD$patient,style="classic",cv=CV,genes=GENES)
node_labels(tree)

Timing of driver mutations (using Model = poisson_tree )

Note that the different colours on the tree indicate the separately fitted mutation rate clades.

##Run null and alt models
NITER=2000
##The following populates a list PD$fit$<treemodel>$<altmodel|nullmodel>.  The function wraptreefit wraps a call to the rtreefit package.
PD=wraptreefit(PD,niter=NITER,b.fit.null = FALSE,method=treemodel)
PD=wraptreefit(PD,niter=NITER,b.fit.null = TRUE,method=treemodel)
par(mfcol=c(1,2))
tree=plot_tree(PD$fit[[treemodel]]$nullmodel$ultratree,cex.label=0);title(sprintf("%s:Ultrametric tree with single rate fitted",PD$patient))
tree=plot_tree(get_colored_markup_tree(PD$fit[[treemodel]]$altmodel$ultratree,PD$nodes$node[PD$nodes$status>=0]),cex.label=0);title(sprintf("%s:Ultrametric tree with multiple rates fitted",PD$patient))

Driver Specific Mutation Rates & Telomere Lengths by Colony & Timepoint

node driver status child_count type colony_count mean_lambda_rescaled correction sd_rescaled lb_rescaled ub_rescaled median_rescaled p_lt_wt
-1 WT 1 -1 local 1 17.89996 1.014746 1.0455655 15.992666 20.04496 17.88861 NA
63 DNMT3A 1 56 local 21 18.30081 1.014746 0.3208797 17.689680 18.96030 18.29556 0.35475
68 JAK2:DNMT3A 1 35 local 25 18.05795 1.014746 0.5876457 16.983898 19.22153 18.04647 0.44325
59 DNMT3A 0 1 local 1 20.23810 1.014746 2.9295554 14.129344 26.67156 20.01563 0.21225
58 CBL 0 1 local 1 14.67284 1.014746 2.5882421 9.164435 20.11756 14.84231 0.89775
88 9pUPD:JAK2:DNMT3A 0 2 local 2 19.00392 1.014746 1.3265712 16.560183 21.93145 18.90989 0.26450
76 TET2:JAK2:DNMT3A 1 8 local 8 17.20811 1.014746 0.9930776 15.312261 19.26226 17.18853 0.67675

Driver Acquisition Timeline

All ages are in terms of post conception years. The vertical red lines denote when colonies were sampled and blue lines when targeted follow up samples were taken.

patient node driver child_count lower_median upper_median lower_lb95 lower_ub95 upper_lb95 upper_ub95 N group age_at_diagnosis_pcy max_age_at_sample min_age_at_sample
PD6629 63 DNMT3A 56 0.0087052 7.021722 0.0061889 0.0223216 5.960489 8.224788 8 DNMT3A 54.8063 66.59274 55.78097
PD6629 68 JAK2 35 14.0861319 29.826768 12.7732820 15.4934830 28.750732 30.938651 8 JAK2 54.8063 66.59274 55.78097
PD6629 88 9pUPD 2 32.9166133 46.100157 31.8524016 33.9989184 44.719007 47.515371 8 9pUPD 54.8063 66.59274 55.78097
PD6629 76 TET2 8 36.4302883 46.652268 35.3107168 37.5932440 45.653070 47.581811 8 TET2 54.8063 66.59274 55.78097
patient node driver child_count lower_median upper_median lower_lb95 lower_ub95 upper_lb95 upper_ub95 N group age_at_diagnosis_pcy max_age_at_sample min_age_at_sample
PD6629 63 DNMT3A 56 0.0086408 7.262048 0.0061942 0.02191 5.951499 8.765735 8 DNMT3A 54.8063 66.59274 55.78097
PD6629 68 JAK2 35 14.2510600 29.707781 12.6407896 15.93306 27.818446 31.597035 8 JAK2 54.8063 66.59274 55.78097
PD6629 76 TET2 8 36.1437225 46.025114 34.3245419 37.89799 44.231836 47.729015 8 TET2 54.8063 66.59274 55.78097
PD6629 88 9pUPD 2 32.7786694 46.309379 30.9287726 34.59708 44.321100 48.149647 8 9pUPD 54.8063 66.59274 55.78097

Methods for timing LOH and amplification

The timing methods require an rough approximation for the number of expected detectable mutations, \(L\), in the LOH/CNA region for the duration of the branch.

Background local mutation rate

Firstly we estimate a local relative somatic mutation rate for mutations detectable by CaveMan in autosomal regions. The rate is measured across a panel of samples consisting of all those colonies in all 12 patients that do not exhibit copy number aberrations. The genome is divided into 100Kb bins and the number of passed somatic mutations is counted across all samples in the panel, to give a count \(c_i\) for bin \(b_i\). The probability that a given mutation occurs in bin \(i\) is estimated by \(p(mut \in b_i)=\frac{c_i}{\sum_j c_j}\) and standard error \(p_{se}=\frac{p(mut \in b_i)}{\sqrt{c_j}}\). For a given copy number region \(C\) then \(p(mut \in C)=\sum_{b_i \in C} p(mut \in b_i)\). For a branch of duration \(t\) and with global mutation rate \(\lambda\) then \(E(L)=\lambda t p(mut \in C)\) and \(Var(L)=\lambda^2 t^2 \sum_{b_i \in C} p_{se}(mut \in b_i)^2\) where here we are not modelling errors in \(t\) and \(\lambda\).

Copy number neutral LOH timing

All somatic mutations that occur prior to the LOH event, occurring a fraction \(x\) along the branch, will be homozygous with detection sensitivity \(s_{\text{HOM}}\) and those after will be heterozygous with detection sensitivity \(s_{\text{HET}}\). We model the mutations as arriving at a constant rate along the branch and fit the following model for \(x\) :

\[ N_{\text{HET}} \sim \text{Poisson} \left( (1-x) L s_{\text{HET}} \right) \] \[ N_{\text{HOM}} \sim \text{Poisson} \left( x L s_{\text{HOM}} \right) \]

with priors \(x \sim \text{Uniform}(0,1)\) and \(L \sim \mathcal{N}(E(L),Var(L))\) and where \(s_{\text{HOM}}=0.5\) (assuming perfect detection of homozygous mutant variants) and \(s_{\text{HET}}\) is estimated from germline SNPs as previously discussed.

Timing duplications

Somatic mutations that occur prior to the CNA event, occurring a fraction \(x\) along the branch, have an equal chance of exhibiting VAF=1/3 or VAF=2/3, whereas those occurring after the event will always have VAF=1/3.

\[ N_{2/3} \sim \text{Poisson} \left( \frac{x L s_{1/3} }{2} \right) \] \[ N_{1/3} \sim \text{Poisson} \left( s_{2/3} \left( \frac{x L }{2} + (1-x) L \right) \right) \]

Where the priors are as in the LOH model above. The detection sensitivities \(s_{1/3}\) and \(s_{2/3}\) are approximated by \(s_{het}\) because of the additional sequencing depth afforded by the duplication, all though it is acknowleged that this approach is likely to underestimate the \(s_{2/3}\). Unlike the LOH case, the value of \(x\) will be relatively unaffected by \(L\) because of the similarity of \(s_{1/3}\) and \(s_{2/3}\).

Classifying variants

The Bernoulli trial probability of the observing the mutant allele on a given read given that true VAF is \(\mu\) and assuming an error rate \(\epsilon\) is as follows:

\[ p(mut)=p(mut|\text{mutant read})p(\text{mutant read})+p(mut|\text{wild type read})p(\text{wild type read}) \] \[ p(mut)=(1-\epsilon)\mu+\frac{\epsilon}{3}(1-\mu) \]

Where in the above "mutant read" etc denotes that the read originated from a mutant chromosome.

Where a variant can be classified as having one of 2 VAFs (\(\mu_1\) or \(\mu_2\), with each having same prior probability:

\[ p(VAF=\mu_1|data)=\frac{p_{\text{binom}}(k=m|n=depth,p=(1-\epsilon)\mu_1+\frac{\epsilon}{3}(1-\mu_1))}{p_{\text{binom}}((k=m|n=depth,p=(1-\epsilon)\mu_1+\frac{\epsilon}{3}(1-\mu_1))+p_{\text{binom}}(k=m|n=depth,p=(1-\epsilon)\mu_2+\frac{\epsilon}{3}(1-\mu_2))} \]

If desired this can be further simplified; setting \(\mu^{*}_{1}=(1-\epsilon)\mu_1+\frac{\epsilon}{3}(1-\mu_1))\) and \(\mu^{*}_{2}=(1-\epsilon)\mu_2+\frac{\epsilon}{3}(1-\mu_2))\) then

\[ p(VAF=\mu_1|data)=\frac{\mu^{*m}_{1} (1-\mu^{*}_{1})^{d-m}}{\mu^{*m}_{1} (1-\mu^{*}_{1})^{d-m}+\mu^{*m}_{2} (1-\mu^{*}_{2})^{d-m}} \]

We use the above to classify both duplicated (VAF=2/3) vs unduplicated (VAF=1/3) and also heterozygous(VAF=0.5) vs homozygous mutant (VAF=1). Only variants where one of these probabilities exceeds 0.95 is classified, the others remain unclassified and are not included in the model.

Note that when applied to shared branches we typically have high accuracy in classifying variants because we consider reads that are aggregated across all colonies that share the branch.

Background Model

library(karyoploteR)
## Warning: package 'karyoploteR' was built under R version 4.0.3
## Warning: package 'regioneR' was built under R version 4.0.3
## Warning: package 'GenomicRanges' was built under R version 4.0.3
## Warning: package 'BiocGenerics' was built under R version 4.0.5
## Warning: package 'S4Vectors' was built under R version 4.0.3
## Warning: package 'IRanges' was built under R version 4.0.3
## Warning: package 'GenomeInfoDb' was built under R version 4.0.5
gr=GRanges(sprintf("chr%s",bins$chr), IRanges(start=bins$start,end=bins$end))
mcols(gr)=data.frame(y=bins$prob)
ymax=max(bins$prob)
ymin=0
kp <- plotKaryotype(chromosomes=sprintf("chr%s",1:22),plot.type=1)
kpAbline(kp, h=0, ymin=ymin, ymax=ymax, lty=2, col="#666666")
kpAxis(kp, ymin = ymin, ymax=ymax)
kpLines(kp, data=gr, ymin=ymin, ymax=ymax)

Summary of LOH timing inference

We apply the above model to any copy number neutral LOH events in the tree:

hethomres=get_hethom(PD,refcolony,b_do_plot=TRUE)
## Warning in FUN(X[[i]], ...): Germline info is censored so not plotting BAFs..

## Number of muts overlapping loh shared in tree = 2
cnatimings=hethomres$hethom
cnatimings2=cnatimings %>% left_join(globaltimings,by="node")
cnatimings2$mean_loh_event=cnatimings2$xmean*(cnatimings2$upper_mean-cnatimings2$lower_mean)+cnatimings2$lower_mean
cnatimings2$lower_loh_event=cnatimings2[["x2.5."]]*(cnatimings2$upper_mean-cnatimings2$lower_mean)+cnatimings2$lower_mean
cnatimings2$upper_loh_event=cnatimings2[["x97.5."]]*(cnatimings2$upper_mean-cnatimings2$lower_mean)+cnatimings2$lower_mean
cnatimings2$t_before_end=(1-cnatimings2$xmean)*(cnatimings2$upper_mean-cnatimings2$lower_mean)
cnatimings2$t_before_end_lower=(1-cnatimings2[["x97.5."]])*(cnatimings2$upper_mean-cnatimings2$lower_mean)
cnatimings2$t_before_end_upper=(1-cnatimings2[["x2.5."]])*(cnatimings2$upper_mean-cnatimings2$lower_mean)
kable(cnatimings2, "html") %>% kable_styling("striped") %>% scroll_box(width="95%")
nhet nhom label node het.sensitivity chr start end kb count_in_bin count_se pmut pmut_se xmean xse_mean xsd x2.5. x50. x97.5. xn_eff xRhat lmean lse_mean lsd l2.5. l50. l97.5. ln_eff lRhat patient driver status driver2 driver3 child_count chknode lower_lb95 lower_ub95 lower_median lower_mean upper_lb95 upper_ub95 upper_median upper_mean lower_mutcount_adj upper_mutcount_adj lower_mutcount upper_mutcount group N max_age_at_sample min_age_at_sample age_at_diagnosis_pcy totcolonies col mean_loh_event lower_loh_event upper_loh_event t_before_end t_before_end_lower t_before_end_upper
0 2 LOH_9pUPD 88 0.9102877 9 0 33393006 33300000 6654 81.57205 0.0145321 0.0001782 0.7988957 0.0010226 0.1685979 0.3719991 0.8433913 0.9943569 27184.2 0.9999445 3.421252 0.0002609 0.0417497 3.339451 3.42126 3.502349 25616.1 1.000053 PD6629 9pUPD 0 9pUPD 9pUPD:JAK2:DNMT3A 2 88 31.8524 33.99892 32.91661 32.91223 44.71901 47.51537 46.10016 46.09592 624.009 872.7397 613 836 9pUPD 8 66.59274 55.78097 54.8063 61 #4DAF4A 43.44463 37.81655 46.02153 2.651298 0.0743964 8.279374

VAF Distribution of Targeted Follow Up Samples

No drivers to add to tree!

No drivers to add to tree!

No drivers to add to tree!

No drivers to add to tree!

No drivers to add to tree!

No drivers to add to tree!

No drivers to add to tree!

Here we exclude all local CNAs and depict as color VAF plots